library(tidyverse)
library(janitor)Indoor logs cleaning
This data is the record indoor temperatures at each unit from April to September. This information was acquired by Lauren McGaughy at KUT through a public information request to the Texas Department of Criminal Justice.
The original PDFs were transcribed into data using Google Pinpoint, resulting in the csv files found in data-original as Indoor-30days-edited.csv and Indoor-31days-edited.csv. The original documents are found in data-original/2023_indoor.
Goals of this notebook
What we’ll do to prepare the data:
- Download the data
- Import it into our notebook
- Clean up data types and columns
- Export data into our next notebook
Setup
Downloading data
This data comes from a public information request to the Texas Department of Criminal Justice. We have two seperate sheets with our 30-day months and our 31-day months, so we’ll read each of those in separately.
indoor_temps <- read_csv("data-original/Indoor-30days-edited.csv")Rows: 201 Columns: 33
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): File Name, Unit, Validation Link
dbl (30): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
indoor_temps |> glimpse()Rows: 201
Columns: 33
$ `File Name` <chr> "SB1R56 - April 2023_1.pdf", "SB1R56 - April 2023_1.…
$ Unit <chr> "Allred", "Beto", "Boyd", "Bradshaw", "Briscoe", "By…
$ `1` <dbl> 75.5, 73.4, 76.6, 80.1, 84.0, 76.9, 77.5, 71.4, 76.4…
$ `2` <dbl> 79.0, 73.9, 76.5, 82.1, 88.0, 74.0, 79.1, 72.8, 76.1…
$ `3` <dbl> 78.1, 78.6, 81.3, 86.3, 91.2, 81.1, 78.6, 74.7, 80.8…
$ `4` <dbl> 80.5, 79.4, 82.5, 80.5, 88.0, 81.2, 75.6, 73.3, 82.2…
$ `5` <dbl> 74.3, 74.2, 77.0, 73.8, 76.4, 75.4, 81.5, 71.4, 73.8…
$ `6` <dbl> 74.2, 69.9, 68.7, 68.5, 67.7, 72.7, 73.2, 72.3, 66.3…
$ `7` <dbl> 76.1, 69.6, 72.1, 70.2, 69.1, 65.1, 67.0, 73.3, 76.4…
$ `8` <dbl> 77.9, 71.3, 74.2, 78.4, 73.1, 72.3, 67.5, 72.5, 78.5…
$ `9` <dbl> 79.0, 73.2, 74.2, 73.0, 76.5, 69.8, 72.0, 73.2, 75.7…
$ `10` <dbl> 78.4, 71.6, 73.1, 79.1, 75.6, 75.8, 72.2, 73.3, 72.1…
$ `11` <dbl> 78.9, 72.1, 78.8, 78.5, 80.6, 76.9, 71.1, 74.4, 77.1…
$ `12` <dbl> 78.3, 72.5, 77.0, 78.3, 80.0, 73.0, 72.4, 75.0, 77.2…
$ `13` <dbl> 79.7, 72.6, 76.5, 75.8, 81.0, 76.0, 74.7, 74.9, 77.2…
$ `14` <dbl> 80.4, 73.4, 77.2, 85.3, 88.6, 74.0, 70.9, 74.4, 75.8…
$ `15` <dbl> 77.1, 79.2, 80.9, 84.9, 90.6, 73.1, 76.4, 69.8, 79.9…
$ `16` <dbl> 76.3, 70.9, 73.3, 82.0, 88.2, 72.0, 72.6, 67.0, 71.0…
$ `17` <dbl> 77.8, 70.8, 76.1, 82.2, 79.4, 77.2, 73.1, 69.3, 77.0…
$ `18` <dbl> 82.2, 71.9, 72.3, 73.7, 83.0, 73.0, 76.5, 77.6, 72.6…
$ `19` <dbl> 81.3, 74.9, 77.8, 82.1, 80.0, 72.5, 75.6, 72.5, 76.5…
$ `20` <dbl> 76.9, 75.8, 78.9, 80.1, 85.0, 76.0, 75.3, 69.6, 79.8…
$ `21` <dbl> 74.4, 75.9, 74.8, 78.1, 82.5, 73.0, 75.5, 74.1, 76.6…
$ `22` <dbl> 75.4, 72.3, 75.6, 80.7, 81.0, 71.0, 74.2, 73.5, 73.8…
$ `23` <dbl> 77.0, 65.3, 65.0, 72.6, 67.9, 76.8, 66.1, 61.0, 76.4…
$ `24` <dbl> 71.6, 69.1, 73.0, 78.2, 72.0, 76.0, 67.3, 61.7, 76.8…
$ `25` <dbl> 77.1, 70.1, 73.3, 79.7, 77.0, 72.6, 72.9, 64.7, 75.7…
$ `26` <dbl> 76.3, 71.2, 73.1, 74.0, 83.6, 71.5, 77.1, 62.2, 76.7…
$ `27` <dbl> 76.9, 71.4, 73.5, 74.8, 83.2, 67.9, 70.3, 61.9, 78.5…
$ `28` <dbl> 77.2, 71.6, 78.6, 80.5, 86.3, 74.5, 78.9, 60.7, 77.4…
$ `29` <dbl> 71.8, 67.7, 73.0, 76.0, 77.1, 67.1, 68.8, 61.9, 76.3…
$ `30` <dbl> 77.0, 71.7, 81.3, 79.1, 84.3, 69.6, 74.9, 63.9, 78.5…
$ `Validation Link` <chr> "https://journaliststudio.google.com/pinpoint-extrac…
other_indoor_temps <- read_csv("data-original/Indoor-31days-edited.csv")Rows: 203 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): File Name, Unit, Validation Link
dbl (31): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
other_indoor_temps |> glimpse()Rows: 203
Columns: 34
$ `File Name` <chr> "SB1R56 - August 2023_1.pdf", "SB1R56 - August 2023_…
$ Unit <chr> "Allred", "Beto", "Boyd", "Bradshaw", "Briscoe", "By…
$ `1` <dbl> 97.7, 91.9, 95.1, 98.2, 96.0, 93.8, 84.3, 89.3, 97.0…
$ `2` <dbl> 99.7, 94.4, 97.0, 97.7, 94.1, 94.2, 87.4, 89.0, 93.8…
$ `3` <dbl> 95.6, 92.8, 97.4, 98.4, 94.9, 93.4, 85.6, 88.8, 92.7…
$ `4` <dbl> 97.6, 91.1, 96.0, 97.6, 91.7, 93.7, 83.0, 88.0, 94.0…
$ `5` <dbl> 100.5, 93.8, 96.4, 97.8, 95.1, 93.9, 84.2, 89.3, 92.…
$ `6` <dbl> 97.1, 94.7, 94.9, 97.4, 95.3, 90.1, 87.2, 85.1, 93.7…
$ `7` <dbl> 93.8, 92.3, 94.5, 94.5, 95.5, 95.2, 84.9, 82.5, 93.5…
$ `8` <dbl> 92.3, 93.1, 95.3, 95.3, 95.4, 94.0, 89.4, 83.5, 93.1…
$ `9` <dbl> 95.8, 91.3, 94.4, 92.8, 98.1, 92.5, 89.9, 83.9, 94.5…
$ `10` <dbl> 93.0, 93.9, 95.3, 95.9, 97.5, 92.9, 91.7, 84.0, 96.2…
$ `11` <dbl> 99.1, 92.7, 93.2, 95.4, 96.1, 92.8, 90.2, 87.3, 94.2…
$ `12` <dbl> 100.1, 93.7, 95.4, 96.3, 97.2, 93.0, 88.6, 87.2, 94.…
$ `13` <dbl> 97.5, 92.4, 95.6, 95.0, 98.1, 92.6, 86.2, 86.0, 93.8…
$ `14` <dbl> 83.3, 93.2, 95.0, 94.2, 96.3, 92.3, 88.6, 81.4, 94.9…
$ `15` <dbl> 85.9, 89.0, 93.7, 92.1, 94.6, 91.2, 87.9, 81.0, 91.1…
$ `16` <dbl> 87.5, 86.8, 91.1, 89.7, 94.5, 90.2, 89.5, 83.5, 88.8…
$ `17` <dbl> 94.3, 90.3, 94.8, 94.6, 96.7, 90.4, 90.5, 87.9, 89.6…
$ `18` <dbl> 97.9, 93.2, 98.0, 89.7, 94.1, 91.8, 90.1, 82.0, 95.1…
$ `19` <dbl> 95.0, 91.9, 94.9, 95.2, 95.0, 91.9, 89.4, 85.7, 96.5…
$ `20` <dbl> 96.3, 94.9, 97.6, 96.1, 98.2, 92.9, 88.7, 85.9, 99.1…
$ `21` <dbl> 93.4, 94.2, 98.2, 96.1, 93.7, 93.3, 87.2, 87.0, 96.6…
$ `22` <dbl> 95.7, 92.3, 96.1, 98.1, 85.6, 92.1, 84.4, 89.9, 93.6…
$ `23` <dbl> 97.8, 94.3, 96.2, 98.6, 89.5, 91.4, 89.1, 89.4, 94.8…
$ `24` <dbl> 99.1, 95.7, 98.9, 98.5, 92.2, 94.2, 91.7, 86.0, 99.5…
$ `25` <dbl> 95.6, 96.2, 98.0, 98.9, 96.2, 92.2, 90.9, 87.8, 95.6…
$ `26` <dbl> 99.3, 96.3, 98.7, 98.6, 93.2, 95.4, 90.7, 84.8, 99.1…
$ `27` <dbl> 97.3, 96.5, 99.6, 98.1, 96.3, 93.2, 85.7, 83.8, 101.…
$ `28` <dbl> 88.0, 86.3, 89.9, 84.3, 97.1, 86.2, 86.9, 84.0, 86.2…
$ `29` <dbl> 87.7, 87.1, 90.7, 89.3, 93.8, 88.2, 88.9, 81.7, 88.6…
$ `30` <dbl> 87.8, 87.3, 91.8, 89.1, 95.4, 86.5, 88.8, 81.6, 88.8…
$ `31` <dbl> 88.6, 87.6, 91.8, 91.2, 97.2, 81.9, 87.9, 81.0, 88.0…
$ `Validation Link` <chr> "https://journaliststudio.google.com/pinpoint-extrac…
Pivot Longer
Right now each day is a column, let’s change it so days are all under one column and clean up the names of our columns.
We’ll do this individually before we join the columns so we don’t have to remove the 31st rows for months with only 30 days.
other_temps_raw <- other_indoor_temps |>
pivot_longer(
cols = "1":"31",
names_to = "day",
values_to = "temperature"
)
other_temps_rawindoor_temps_raw <- indoor_temps |>
pivot_longer(
cols = "1":"30",
names_to = "day",
values_to = "temperature"
)
indoor_temps_rawCombine data frames
Let’s combine both of our datasets to make a single table.
indoor_temps_combined <- bind_rows(other_temps_raw, indoor_temps_raw)
indoor_temps_combinedClean names
One column we’ll have to convert is our file_name column into a month/year column. In order to do that, we first have to create a value with all of the characters we want to remove.
word_list <- c("SB1R56 - |_1.pdf|_2.pdf")Now that we’ve done that, let’s clean names and make our new column.
daily_indoor <- indoor_temps_combined |>
clean_names() |>
mutate(
month_year = str_remove_all(file_name, word_list)
)
daily_indoorAdd date column
We’ll need to clean up the dates so they’re standardized and readable. We’ll combine our month/year column with the day column to put through the lubridate function and, while we’re at it, remove unnecessary columns.
daily_indoor_dated <- daily_indoor |>
# distinct() |>
unite(
col = "temp_date",
day, month_year,
sep = " "
) |>
mutate(
date = dmy(temp_date)
) |>
select(!c(file_name, validation_link, temp_date)) |>
mutate(unit_temp = temperature) |>
select(!temperature)
daily_indoor_dated |> glimpse()Rows: 12,323
Columns: 3
$ unit <chr> "Allred", "Allred", "Allred", "Allred", "Allred", "Allred", …
$ date <date> 2023-08-01, 2023-08-02, 2023-08-03, 2023-08-04, 2023-08-05,…
$ unit_temp <dbl> 97.7, 99.7, 95.6, 97.6, 100.5, 97.1, 93.8, 92.3, 95.8, 93.0,…
Check dates
daily_indoor_datedClean some unit names
It looks like the Ramsey unit was recorded with slightly different names in each csv. Let’s combine both of those columns so all of the data is together and then remove any leftover NA’s.
Mcconnell 183
O’Daniel 91
Wainwright 183
daily_indoor_named <- daily_indoor_dated |>
mutate(
unit = case_match(
unit,
"Mcconnell" ~ "McConnell",
"O’Daniel" ~ "O'Daniel",
.default = unit
)
)We don’t have recordings for every unit. These are the units we do not have readings for: Baten, Bell, Bridgeport, Coleman, Cotulla, Diboll, Duncan, East Texas, Estes, Fort Stockton, Garza East, Glossbrenner, Halbert, Hamilton, Havins, Henley, Hodge, Hospital Galveston, Kegans, Kyle, LeBlanc, Lindsey, Marlin, Mechler, Moore, B., Ney, Pack, San Saba, Sayle, Scott, Skyview, Travis, Willacy.
Export the data
We did it! Now let’s export our data and place it in our processed data folder.
daily_indoor_named |> write_rds("data-processed/01-indoor-cleaned.rds")